Project 1

Note: if you feel like your Pandas skills need a bit of a touch up, check this article out!

The dataset we're going to be using is one of the most realistic retail time-series datasets you'll find out there because, well, it's actual Walmart data. It was made available for a Kaggle competition that you can check out here.

The original format of the data was in a "wide" format to made it smaller in memory, but that doesn't really work too well with databases and you won't see that very often in the real world. The most notable changes are that I added in a date column to replace the date identifier columns that were previously there, and I made the data smaller by only subsetting to the state of Texas.

Another note: If you want to develop and test your code with a smaller dataset (which I'd probably recommend), set sampled in the cell below to True. All of the tests will still pass if your code is correct!

Let's get into it!

Getting our data in the right format

Time-series data has to be collected from some real-world, data-generating process. That means that raw data comes in as a series of observations. Depending on your experience with time-series data, you may be used to data that looks like this:

Date Sales
2022-01-01 23
2022-01-02 45
2022-01-03 12
2022-01-04 67
2022-01-05 89

But, if you're in retail, each of those "sales" probably came in some JSON from some point-of-sale system (i.e. cash register) that probably looked something like this:

{
    "timestamp": 2022-01-01 12:34:56,
    "product_id": 5,
    "store_id": 12,
    "category_id": 36,
    ...
}

Usually, it's the job of a data engineer to collect all of these records and aggregate them into a nice, tabular format, but it's worth at least having an appreciation for how it's done. So, we're going to start from a mock version of a transactions table.

You can see that this is a DataFrame where each row relates to purchases for an individual item. Here's a little data dictionary:

date is supposed to be a datetime-like object, but you can see that when we loaded it from disk, it was loaded in as a string. Let's convert that column to datetime.

Our goal is to transform this dataset into one that's easy to analyze and train models on. For this project, our goal is going to be to work at the daily level. So, our first step is to aggregate our transactions data up to the daily level.

To be more specific, this is what we want it to look like:

You can see that the sales column is really just a daily count of transactions for that particular id.

In the cell below, create a dataframe called data, which is the transactions dataframe aggregated to the daily level. It should look like the above, except you won't have zero sales days. Don't worry about order: the below test will handle that!

If the cell below runs without error, you did it right!

Optimizing our data

Let's take a look at how our data is being stored in memory.

1.5 GB of data for our purposed is certainly no joke. But how much of that is really necessary?

Most of our data is stored in the least memory efficient format for pandas: strings (objects). Let's fix that.

Hint: check out this page of the pandas documentation that talks about data types.

In the below cell, convert the data types of columns to reduce memory usage as much as possible.

In my solution, I got the final DataFrame down to 90.4 MB, which is about 6% of the original size!

While we're at it, it's worth talking about the best way to store this data on disk. If we saved this as a CSV, it wouldn't maintain any of the data type modifications we just made. Pandas offers a bunch of options for saving DataFrames, but here are the two I'd recommend:

On my local machine, loading our original CSV took ~8.7 seconds, and that only took 0.1 seconds. And our data types were maintained! Nice!

Finishing up our data pre-processing

There's one last modification we need to make to our data before it's ready to go. The way that we converted transactions into sales was slightly problematic because now, when a product doesn't sell it just isn't present in our data, rather than appearing as a zero.

That's an issue for our forecasting models, so let's fix it!

First, set your index to columns that the DataFrame is distinct on (date and id).

Now, create a MultiIndex with all combinations of daily dates and ids using pd.MultiIndex.from_product and use it and .reindex() to fill the gaps in your data.

Finally, fill the resulting NaNs in your dataframe. Hint: it's tempting to use .groupby().fillna(method='ffill') (and backfilling), but unfortunately this method is quite slow on grouped data. I'd recommend manually recreating the categorical columns by splitting the id column on underscores. This cell could take over a minute to run depending on how you implement it!

Exploring our data

Exploratory data analysis is crucial for building the best models.

Before you start this section, though, I would highly recommend that you set the index of your DataFrame to be on both the date and id field (our DataFrame has one row for each date/id combo). It's up to you, but it's good practice!

For this section, find 3-5 insights about the data that you feel are helpful for building models. Specifically, we'll be building models at the date/dept_id level (i.e., a forecast for FOODS_1 on 2011-02-01, 2011-02-02, etc., a forecast for HOBBIES_1 on 2011-02-01, 2011-02-02, etc.)

The only required one is an autocorrelation analysis. Other than that, some ideas are:

Anything goes! Be creative!

Here's an example of plotting the category-level sales for FOODS_1 to get you started:

(
    data
    .groupby(['date', 'dept_id'])
    .sales
    .sum()
    [:, 'FOODS_1']
    .plot()
)

Training some models!

Finally, we can train some models!

We're going to use the statsforecast library, since it makes training statistical time-series models really easy. There are other great libraries (like darts, which is more mature of a package) but I like statsforecast a bit more for these models. Eventually, we'll get to training our own models from scratch.

Here's what you need to do:

  1. Aggregate sales up to the date/dept_id level so each date has 7 distinct records (one for each dept_id).
  2. Convert your aggregated data into the format that statsforecast likes.
  3. Fit and evaluate some models! This part is made straightforward by statsforecast so feel free to fit whatever you want, but focus on models like this one and this one since we discussed them. Their documentation has a quickstart to get going. I provided you with some helper code below to get started.
    • Play around with ARIMA and HoltWinters and tune them using your intution, then compare them to AutoARIMA and AutoETS, which do the tuning for you!
  4. (If your time permits) try out other libraries! Go try to fit a Prophet model, fit some models using darts and see how they compare, etc.